An introduction to the package {targets}
Use it to coordinate your data analysis projects
🧑💻
🎯
code = results 🔬“{targets} implicitly nudges users toward a clean, function-oriented programming style that fits the intent of the R language”
Tip
Quick plug that you can use my package to create a pre-populated R project directory!
More information here: https://github.com/JT-39/dau-R-template-ext
The project directory could look something like…
source(here::here("src/R/functions.R"))
# Path to absence data
absence_data_file_path <- here::here("_data/raw/1_absence_3term_nat_reg_la.csv")
# Extract national absence and format date
df_nat_absence <- get_nat_absence_data(absence_data_file_path) |>
format_time_period()
# Fit a linear model
model <- fit_model(df_nat_absence)
# Plot the data and model
plot_model(model, df_nat_absence)# Pull national absence from file
get_nat_absence_data <- function(file_path) {
read.csv(file = file_path) |>
dplyr::filter(geographic_level == "National")
}
# Extract the start year from academic year
extract_year <- function(date) {
paste0(substr(date, 1, 4))
}
# Format the year as a date
format_time_period <- function(data) {
data |>
dplyr::mutate(Date = lubridate::year(as.Date(extract_year(time_period),
format = "%Y")),
.after=time_period)
}
# Fit the model and pull coefficients
fit_model <- function(data) {
lm(sess_overall_percent_pa_10_exact ~ Date, data) |>
coefficients()
}
# Round to the nearets multiple of five
round_to_multiple_five <- function(x) {
ceiling((x + 1)/5)*5
}
# Plot the data and model
plot_model <- function(model, data) {
ggplot2::ggplot(data) +
ggplot2::geom_point(ggplot2::aes(x = Date,
y = sess_overall_percent_pa_10_exact,
colour = school_type)) +
ggplot2::geom_line(ggplot2::aes(x = Date,
y = sess_overall_percent_pa_10_exact,
colour = school_type)) +
ggplot2::scale_colour_manual(values = kasstylesr::color_picker(4),
breaks = c("Total", "State-funded primary",
"State-funded secondary", "Special")) +
ggplot2::geom_abline(intercept = model[1],
slope = model[2],
show.legend = T,
colour="red",
linetype="dashed") +
ggplot2::annotate("text",
x = max(data$Date),
y = lm(sess_overall_percent_pa_10_exact ~ Date,
df_nat_absence) |>
fitted.values() |>
max(),
hjust = -0.45,
label = "Line of best fit",
colour = "red") +
ggplot2::scale_y_continuous(breaks = scales::pretty_breaks(),
limits = function(x) {
c(0, round_to_multiple_five(max(x)))
}) +
ggplot2::coord_cartesian(clip = 'off') +
ggplot2::theme_minimal() +
kasstylesr::kas_style() +
ggplot2::labs(
title = "Average persistent absence over time in England",
subtitle = "Split by school type. Only includes persistent absentees.",
x = "",
y = "Overall absence rate (%)",
colour = "School type:"
)
}tar_target()
namecommand ~ the function that generates the target> dispatched target data_file
o completed target data_file [0.42 seconds]
> dispatched target nat_data
o completed target nat_data [0.47 seconds]
> dispatched target nat_data_clean
o completed target nat_data_clean [0 seconds]
> dispatched target model
o completed target model [0 seconds]
> dispatched target plot
Saving 7 x 7 in image
o completed target plot [0.7 seconds]
> ended pipeline [3.46 seconds]
targets} creates a pipeline of pure functions_targets/objects/
.rds format_targets/ to .gitignore (for GitHub) ☝️Advanced
format = "file")targets} will not track the data (any changes to it)targets} keeps track of changes in files and functions 🕵️targets} knows which parts of the pipeline can be ran in parallelNeed to load a few more packages:
Utilise the {targets} function to run in parallel:
Simple as that! 💥
🔮
.Rmd or .qmd to the pipelinetargets} computation 🔋sqltargets} which applies {targets} principles to .sql files 📦🎯
{targets} manual: Link
YouTube {targets} walkthrough: Link
Ofsted MI {targets} example pipeline GitHub: Link
These slides and mini {targets} example GitHub: Link
{sqltargets} GitHub: Link
Email me at:
jake.tufts@education.gov.uk